Techniques to generate and store graph models from structured and unstructured data in a cloud-based graph database system

ABSTRACT

Embodiments include systems, methods, articles of manufacture, and computer-readable media configured process data in a structured format and an unstructured format and applying one or more algorithms to detect elements and links between the elements in the data. Embodiments are further configured to generate a graph model comprising nodes comprising the elements and edges comprising the links.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to and is a continuation of U.S. patentapplication Ser. No. 17/089,418, entitled “TECHNIQUES TO GENERATE ANDSTORE GRAPH MODELS FROM STRUCTURED AND UNSTRUCTURED DATA IN ACLOUD-BASED GRAPH DATABASE SYSTEM” filed on Nov. 4, 2020. The contentsof the aforementioned application are incorporated herein by referencein their entirety.

BACKGROUND

Enterprise systems have typically been implemented on a localized systemand store data in one or more rational databases. These databasesinclude highly-structured tables that require developers andapplications to strictly structure the data used in their applications.Moreover, the rigidity of such databases generally require operationsthat are compute-heavy and memory-intensive and have an exponentialcost. Other database structures, such as NoSQL structures, have beenutilized but are generally simple and have limited operations. Further,both relational and NoSQL structures miss the importance of connectionsbetween data, which is critical when trying to detect related issuesthat are not apparent at first glance.

Current trends include utilizing a graph model structure or a graphdatabase to store data that is highly connected to provide flexibilityin adding data, running faster relationship-based searches, and indexingby relationships. However, current graph databases often fail to createbetter relationships between data or elements because they rely on auser or administrator to provide a graph model to define therelationships. Embodiments discussed herein solve these problems.

BRIEF SUMMARY

Embodiments may be generally directed to techniques and systems,including a storage configured to store graph databases and one or moreprocessors coupled with the storage. The systems may also include memorycoupled with the storage and the one or more processors, the memory tostore instructions, the instructions that when executed by the one ormore processors, cause the one or more processors to obtain first datacomprising a first set of elements in a structured format and firstconnection information, obtain second data comprising one or more textsegments, the one or more text segments comprising potential elements,and apply a name entity recognition analysis to the second data todetect a second set of elements and second connection information, thesecond set of elements detected from the potential elements and thesecond connection information to indicate links between one or more ofthe second set of elements. The instructions may further cause the oneor more processors to generate a graph model comprising nodes and edgesfrom the first set of elements, and the second set elements, whereineach node comprises an element from the first set of elements or thesecond set of elements, and each edge to link one of the nodes toanother one of the nodes based on the first connection information, thesecond connection information, or a combination thereof, and store thegraph model in a graph database in the storage.

Embodiments also include A computer-implemented, comprising obtainingstructured data comprising node information and connection information,the connection information to indicate links between a first set ofelements of the node information, obtaining unstructured data comprisingone or more text segments, the one or more text segments comprisingpotential elements, and applying a name entity recognition analysis tothe unstructured data to detect a second set of elements and additionalconnection information, the second set of elements detected from thepotential elements and the additional connection information to indicatelinks between one or more of the second set of elements. The method alsoincludes generating a graph model comprising nodes and edges from thefirst set of elements, and the second set elements, wherein each nodecomprises an element from the first set of elements or the second set ofelements, and each edge to link one of the nodes to another one of thenodes based on the connection information, the additional connectioninformation, or a combination thereof, and storing the graph model in agraph database in a storage.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

To easily identify the discussion of any particular element or act, themost significant digit or digits in a reference number refer to thefigure number in which that element is first introduced.

FIG. 1 illustrates an example of a system 100 in accordance withembodiments.

FIG. 2 illustrates a processing flow 200 in accordance with embodimentsdiscussed herein.

FIG. 3 illustrates a graph model 300 in accordance with embodiments.

FIG. 4 illustrates a processing flow 400 in accordance with embodiments.

FIG. 5 illustrates a diagram 500 in accordance with embodiments.

FIG. 6 illustrates a processing flow 600 in accordance with embodiments.

FIG. 7 illustrates a diagram 700 in accordance with embodiments.

FIG. 8 illustrates a processing flow 800 in accordance with embodiments.

FIG. 9 illustrates a graph model 900 in accordance with embodiments.

FIG. 10 illustrates a computer architecture 1000 in accordance withembodiments.

FIG. 11 illustrates a communications architecture 1100 in accordancewith embodiments.

DETAILED DESCRIPTION

FIG. 1 illustrates an example of a system 100 that may operate inaccordance with embodiments discussed herein. FIG. 1 illustrates system100 in a simplistic manner and includes a number of components notillustrated in FIG. 1 , including additional systems, computing devices,networking equipment, processors, memory, storage, interfaces,connecting mediums (wireless and/or wired), and so forth.

System 100 includes a system 102, which may be an enterprise systemincluding a number of servers, computers, computing devices, processors,memory, storage, etc. to integrate a number of systems, applications,protocols, and formats. In one example, an enterprise system enables abusiness to process data to support particular functions or processes.The data processing may include analyzing data and providing the data ina format to enable users to make decisions. For example, the enterprisesystem may process data to support an audit department performing audittasks and risk management.

Historically, an enterprise system stored data in one or more databasesin highly-structured tables with predetermined columns of specific typesand many rows of those defined types, e.g., a relational databaseformat. Due to the rigidity of their organization, relational databasesrequire developers and applications to strictly structure the data usedin their applications. In relational databases, references to other rowsand tables are indicated by referring to primary key attributes viaforeign key columns. Joins are computed at query time by matchingprimary and foreign keys of all rows in the connected tables. Theseoperations are compute-heavy and memory-intensive and have anexponential cost. When many-to-many relationships occur in the model,you must introduce a JOIN table (or associative entity table) that holdsforeign keys of both the participating tables, further increasingoperation costs. Those types of costly join operations are oftenaddressed by denormalizing the data to reduce the number of joinsnecessary, therefore breaking the data integrity of a relationaldatabase. Other database structures have been used, such as the NoSQLstructures, which are aggregate-oriented and group data based on aparticular criterion. These models generally provide simple and limitedoperations. Both relational and NoSQL structures miss the importance ofconnections between data, which is critical when trying to detectrelated issues that are not apparent at first glance.

Current trends include utilizing a graph model structure or a graphdatabase to store data that is highly connected to provide flexibilityin adding data, running faster relationship-based searches, and indexingby relationships. However, current graph databases often fail to createbetter relationships between data or elements because they rely on auser or administrator to provide a graph model to define therelationships. Often relationships between data are missed or notapparent to the user because of the impossible task of looking throughhuge data sets. Embodiments discussed herein provide technicalimprovements to graph databases by applying one or more graph algorithmsin a new and useful manner to more accurately detect relationshipsbetween data or entities. Thus, embodiments improve the accuracy of agraph model and enable a graph database to return more accurate resultsin response to queries. Another limitation of the currentimplementations of graph databases is their ability to scale. Currentimplementations generally require all the data to be stored on oneserver or a single and cannot be scaled beyond a certain point. However,embodiments discussed herein solve this scalability issue byimplementing the graph database on a cloud-based system, such as system106.

As mentioned, system 102 may include any number of servers and mayinclude or be coupled with one or more storage systems, such as storage108, storage 110, and storage 112. Storages 108, 110, and 112 may be anytype of storage, such as a storage array, and include devices such ashard disk drives, solid-state drives, tape drives, or any other type ofoptical, magnetic, and/or semiconductor storage.

In embodiments, the storages 108, 110, and 112 may store any type ofdata, such as structured data and unstructured data. In the illustratedexample, storage 108 stores structured data. The structured data may beany organized data that conforms to a certain format. For example, thestructured data may be highly-organized and formatted in a way that iseasily searchable in a relational database. Examples of structured datainclude numbers, dates, and groups of words and numbers called strings.For example, the structured data may include audit data, prime issuedata, human resource data, and so forth.

In some instances, the structured data may be in already be in a graphmodel format configured automatically by system 102 or by a user of thesystem 102. However, the structured data in storage 108 may be apreliminary format prior to one or more operations or graph algorithmsperformed on the data to detect additional edges or relationshipsbetween the elements and the inclusion of the unstructured data.

In the illustrated example, storage 110 stores unstructured data. Theunstructured data may be data that does not have a predefined data modelor is not organized in a predefined manner. Unstructured data may betext-heavy, such as data found in books, journals, documents, metadata,health records, audio, video, analog data, images, files, laws,regulations, and organizations. Additional examples include text foundin the body of an e-mail message, Web pages, or word-processor document.With respect to the audit example, the unstructured data may includeobjects or elements found in laws, regulations, organizations, productdescriptions, and people that are related to the audit universe but aredescribed in text descriptions.

In embodiments, the system 102 may include or be coupled with additionalstorage to store data while performing processing operations and theresults of the processing operations. For example, storage 112 can beutilized to store raw data, e.g., structured and/or unstructured data,prior to performing one or more parsing and transformation operations onthe data, as will be discussed in more detail below. The storage 112 mayalso store the results of processed data after one or more operationsare performed on the raw data. The processed data may include parseddata and may be in a graph model format.

System 102 may also be coupled with other systems, devices, components,and so forth. In the illustrated example, the system 102 is coupled withsystem 106 via network 104. The network 104 may be any type of networkand include wired and/or wireless connections. In some instances, thenetwork 104 may include the Internet, and the system 106 may be acloud-based computing system. In one example, system 106 may be anon-demand cloud computing platform and provide functionality for system102 to perform operations, as discussed herein. For example, the system106 provides cloud computing web services, including abstract technicalinfrastructure and distributed computing building blocks and tools.These services may be provided on a virtual cluster of computers,available all the time, and through the Internet. The cluster ofcomputers may provide virtual computers to emulate the components of areal computer, including hardware central processing units (CPUs) andgraphics processing units (GPUs), local/RAM memory, hard-disk/SSDstorage, one or more operating environment, and pre-loaded applicationsoftware such as web servers, and databases including graph databases.The system 106 may be implemented as server farms. In some instances,the system 106 may be a third-party system, such as Amazon® WebServices. The third-party system may offer services on a fee-basedstructure, including models, such as “Pay-as-you-go.”

The system 106 may be configured and provide graph database services 120and one or more graph databases, which may be stored in storage 114. Inembodiments, a graph database may be a database that uses graphstructures for semantic queries with nodes, edges, and properties torepresent and store data. As previously discussed, the advantages ofgraph databases include the edges, which define relationships betweennodes or elements. A graph model may relate data elements to acollection of nodes and edges, the edges representing the relationshipsbetween the nodes. The relationships allow data elements to be linkedtogether directly and, in many cases, retrieved with one operation orquery. In some instances, the graph databases hold relationships betweendata elements as a priority. Querying relationships is fast because theyare perpetually stored in the database. Relationships can be intuitivelyvisualized using graph databases, making them useful for heavilyinter-connected data, such as audit data.

In embodiments, the system 106 may be coupled with and/or includestorage 114 to store the one or more graph databases. The system 106 mayuse an underlying storage mechanism to store the graph databases. In oneexample, the system 106 may store a graph database based on a relationalengine and “store” the graph data in a table implemented level ofabstraction between the graph database, the graph database managementsystem, and the physical devices (storage 114) where the data isactually stored. In another example, the system 106 may utilize akey-value store or document-oriented database for storage.

In example embodiments, the system 106 may utilize Amazon® Neptune toimplement the one or more graph database services 120. The graphdatabase services 120 may utilize a fully managed graph databaseservice, such as Neptune, to build and run applications that work withhighly connected datasets, such as the audit data. In embodiments, thesystem 106 may implement the graph database services 120 and enable readreplicas, point-in-time recovery, continuous backups, and replication.In addition, the system 106 may provide data security features, andsupport data encryption at rest and in transit. In a specific example,the system 106 may provide a primary database instance that supportsread and write operations and performs all of the data modifications toa cluster volume. In some embodiments, the system 106 may deploy anumber of database clusters, and each cluster may have a primarydatabase instance that is responsible for writing (that is, loading ormodifying) graph database contents. The system 106 may also connect areplication database to the same storage volume as the primary databaseinstance and supports only read operations.

In embodiments, the system 106 also includes a query or searchfunctionality 118 to enable users to query the graph database(s). Thesystem 106 enables users to create interactive graph applications thatcan query billions of relationships in milliseconds. The searchfunctionality 118 may include open graph application program interfaces(APIs) such as Gremlin® and SPARQL®. SQL queries for highly connecteddata are complex and hard to tune for performance. The system 106enables users to build queries that efficiently navigate highlyconnected datasets.

In some embodiments, the system 106, including the search functionality118, provides a distributed, multitenant-capable full-text search enginewith a Hypertext Transfer Protocol (HTTP) web interface and schema-freeJavaScript® Object Notation (JSON) documents, such as Elasticsearch®.The search functionality 118 may include scalable searching, nearreal-time searching, and supports multitenancy. In some embodiments, thesearch functionality 118 may be distributed and include indices dividedinto shards, and each shard can have zero or more replicas. The system106 may be configured such that each instance or node hosts one or moreshards and acts as a coordinator to delegate operations to the correctshard(s). The search functionality 118 may provide automatic rebalancingand routing. In some configurations, system 106 may store related datain the same index, which consists of one or more primary shards and zeroor more replica shards.

In embodiments, the search functionality 118 is provided through theJSON and Java API(s), which may be accessed by system 102. Utilizing adistributed, multitenant-capable full-text search engine, such asElasticsearch®, in an Audit universe enables users to uncover riskpervasiveness by identifying risk pervasiveness into multiple lines ofbusiness based on the relationships defined by the edges and the searchfunctionality's 118 ability to traverse the graph database(s) andextrapolate the connections, as will be discussed in more detail below.

In embodiments, the search functionality 118 may be full-text integratedinto the graph database services 120 utilizing clusters. This may enableusers of system 102 to use the search indexing capabilities within theclusters with the graph model or data stored in the graph database(s) toquickly query results. For example, Elasticsearch's built-in textindexing and query capabilities enable customers to run full-text searchquery types such as match query, intervals query, and query strings. Forexample, a user may utilize a graph traversal language, such asGremlin®, using the wideSideEffect step and pass the Elasticsearchendpoint, search pattern, and field information to perform a query.However, embodiments are not limited in this manner, and differenttraversal languages may be utilized to perform queries.

In embodiments, the system 106 may receive a query from the system 102,perform a search utilizing the search functionality 118, and generate aresult. The result may include elements related to the query. Forexample, the search functionality 118 may determine nodes and edges ofthe graph model associated with the query. The nodes may includeelements and may be determined by the relationship identified byconnecting edges. The system 106 may feed the results, via a graphtraversal language function of search functionality 118, into a webservice application(s) 116 to enable a user to visualize the results.The system 106 includes one or more web service application(s) 116,including a graph and data visualization application, such as TomSawyer®. The graph and data visualization application may organize theresults, including the nodes and edges, such that the user mayunderstand relationships, trends, and patterns. The visualizationapplication includes an API library-based Software Development Kit (SDK)and graphics-based design software (IDE). In some embodiments, thelibrary includes a visualization API, a layout API, and an analysis API,which may be used by a user and system 102 to process the results of aquery and present the results in a meaningful way.

FIG. 2 illustrates an example processing flow 200 that may be performedby system 100. Specifically, the operations of processing flow 200 maybe performed by system 102 and system 106 to apply graph algorithm to ondatasets, generate graph models, and store graph models in a graphdatabase(s).

At block 202, the processing flow 200 includes obtaining data. Inembodiments, the data may be stored in one or more storage systems, suchas storage 108 and storage 110 or data warehouses and include structureddata and unstructured data. The structured data may include data in anyorganized format. For example, structured data may be data organized ina comma-separated values (CSV) format in a delimited text file and eachline in the file is a data record. In another example, the structureddata may be in a JavaScript Object Notation (JSON) format, includingattribute-value pairs and array data types. In a third example, thestructured data may be organized as text parsed with regularexpressions. Embodiments are not limited in this manner. In someinstances, at least a portion of the structured data may be configuredin accordance with a graph model.

In embodiments, the obtained data may include unstructured data that mayinclude one or more text segments. The unstructured data does not have apredefined data model or is not organized in a predefined manner. Asmentioned, the unstructured data may include objects or elements foundin the text of laws, regulations, organizations, product descriptions,and people that are related to the audit universe but are described intext descriptions. In embodiments, the unstructured data may be obtainedfrom a storage location, such as storage 110. However, embodiments arenot limited in this manner. In some instances, the unstructured data maybe obtained from websites, computer documents, books, journals,documents, metadata, records, audio data, video data, analog data,images, files, and e-mail messages. For example, the system 102 mayperform a scrape or crawl of websites associated with laws andregulations to collect text data. The same method may be applied toother sources, such as company documents, recordings of help/complaintchats, internal and external e-mails and so forth.

In embodiments, the data may include one or more different data types.In one example, the different types of elements may include issues(banking, customer service, legal, compliance, etc.), validation items,account executives (or line of business) and audit entities, such asMatter(s) Requiring Attention (MRAs), online news sources, customer calltranscripts, controls, and risks. The issues may be problems orpotential problems that may affect other elements, such as an accountexecutive(s), line of business(es) or another entity(ies). Thevalidation items may be rules, regulations, laws, and/or other standardsthat may apply to the other elements. Examples of validation items mayinclude the Uniform Retail Credit Classification and Account Management(URCCAM), Right To Financial Privacy Act (RFPA), Regulation O—Loans toExecutive Officers, Directors, and Principal Shareholders of MemberBanks, Interagency Guidance on Credit Card Lending—Account Managementand Loss Allowance, Fair Credit Reporting Act (FCRA), US-FTC-GuideConcerning Use of the Word Free and Similar Representations, InteragencyGuidance on Authentication in an Internet Banking Environment,Regulation B—Equal Credit Opportunity Act, Consumer Financial ProtectionBureau (CFPB)—Phone Pay Fees, Operating Rules of the National AutomatedClearing House Association (NACHA), Controlling the Assault ofNon-Solicited Pornography And Marketing Act (CAN-SPAM), and Unfair orDeceptive Acts or Practices (UDAP). The account executives may be theleads of specific lines of business. The audit entities may be externalbusiness partners.

At block 204, the processing flow 200 may include performing one or moreoperations or transformations on the data and apply graph algorithms onthe datasets. The obtained data may be in a raw data format and may beparsed and transformed by the system 102 into proper nodes and edges foringestion by the system 106, for example. With respect to the structureddata, the system 102 may generate nodes with the data based on thedifferent data types and connection information. Each node may includean element of a specific piece of data and include an identifier orname. In embodiments, the nodes may also be grouped based on the datatypes. Data of the same data type may be grouped together based on adata group identifier. Thus, when presented to a user, the elements ofthe same data type may be presented together or at the same level, asillustrated in FIG. 3 .

In embodiments, the connection information for the structured data mayinclude one or more rules that may be utilized by the system 102 todefine relationships between nodes and generate edges. For example, oneor more rules may link or define edges between different data types,such as relationships between business issues and different lines ofbusiness. In another example, the rules may include user-definedrelationships, e.g., a user may utilize an input device to define linksbetween nodes. In embodiments, the relationships between the nodes mayalso be based on probabilities and similarity analysis, as discussedbelow. For example, the system 102 may apply probabilistic models todata to determine whether a relationship exists between the nodes. Theprobabilistic models may be trained utilize historical data includingdefined relationships between nodes.

The system may also process and transform unstructured data. Forexample, the system 102 may apply a graph model including one or moretext or character recognition techniques to determine entities for nodesand connection information to link nodes. In one specific example, thesystem 102 may determine entities for the nodes by applying a namedentity recognition (NER) technique to the text segments of theunstructured to identify named entities. The NER analysis is a task ofinformation extraction operation that locates and classifies namedentities into predefined categories, such as law, organization, productname, etc. The system 102 may also determine connection information forthe unstructured data based on the same entities within differentsegments or documents. For example, the system 102 may create a linkbetween different text segments and entities within the text segmentsbased on having at least one common named entity, as will be discussedin more detail in FIG. 4 and FIG. 5 .

The system 102 may also determine connection information for thestructured data and unstructured by applying a text similarity analysis.In one example, the system 102 may apply a cosine similarity analysis toeach of the structured data and/or unstructured data to measure thesimilarity in the text to determine connection information and elementsof nodes. The one or more edges may be determined for the nodes based ona similarity score above a similarity threshold as a result of thetext-similarity analysis. The similarity threshold may be set by anadministrator or automatically by system 102 based on a selected numberof nodes and/or connections for a given graph database. With respect tothe audit example, the system 102 may record or document issues over aperiod of time, e.g., a number of years. Some of these issues, may beraised in different lines of businesses, share common themes or involvethe same business processes. The system 102 may utilize the textembeddings (sentence embeddings, document embeddings) and cosinesimilarity to calculate similarities between pairs of issues, andconnect these issues in the graph database. The system 102 may alsodetermine similarities between other text-rich nodes, such as issuesaudit entity descriptions controls, processes, news, regulations, etc.One advantage of applying these techniques is providing betterconnectivity and allowing graph queries to extract themes and emergingissues that are otherwise hard to detect.

Embodiments are not limited to utilizing a cosine similarity analysis todetermine text similarities in the data. Other examples may includemachine-learning techniques including structured learning, by training amodel with a data set or unstructured learning. Other algorithms mayinclude, but are not limited to, Jaccard Similarity, Differentembeddings+ K-means, Different embeddings+ Cosine Similarity,Word2Vec+Smooth Inverse Frequency+Cosine Similarity, Differentembeddings+LSI+Cosine Similarity, Different embeddings+LDA+Jensen-Shannon distance, Different embeddings+ Word Mover Distance,Different embeddings+ Variational Auto Encoder (VAE), Differentembeddings+Universal sentence encoder, Different embeddings+ SiameseManhattan LSTM, BERT embeddings+Cosine Similarity, Knowledge-basedMeasures.

In embodiments, at block 206 processing flow 200 may include running agraph algorithm and generating a graph model. The graph model may begenerated based on the identified elements and connection informationdetermined from one or more of the analyses performed. For example, thesystem 102 may generate one or more nodes and edges file(s) to ingestinto system 106 to store as a graph database.

The system 102 may also generate one or more JSON files to ingest forthe search functionality 118. As previously discussed, the data may betransformed into structured data and the system 102 may utilize anAmazon service, such as Amazon's ElasticSearch service, Kinesis DataFirehose, Logstash, or Cloudwatch Logs, to ingest the data into thesearch cluster and index the graph database at block 208. The system 102may enable users to utilize an existing Elasticsearch cluster or createa new one to use with full-text search queries. The unified JSONdocument structure may store both SPARQL and Gremlin data inElasticsearch. The search functionality 118 enables users to runfull-text search query types such as match query, intervals query, andquery strings using extensions to Gremlin and SPARQL queries.

At block 210, the processing flow 200 includes storing the storage 114.In one example, the files may be copied from the system 102 to a storageservice, such as an Amazon® S3 bucket. An AWS Identity and AccessManagement (IAM) role may be generated with permissions to make AWSservice requests. An Amazon® S3 endpoint may be created to couple withthe graph database. A load operation may be performed by sending arequest via HTTP call to a graph database instance and the graphdatabase instance assumes the IAM role to load the data from the bucketto the graph database. Embodiments are not limited in this manner, andother ingestion techniques may be utilized. In addition, the JSON filesmay also be ingested to the ElasticSearch.

FIG. 3 illustrates an example of a graph model 300 configuration. In theillustrated example, graph model 300 may include a number of nodes302-346 connected via edges illustrating relationships between thenodes. The nodes 302-346, may include elements or text that may conveyinformation to a user of the system. In embodiments, the nodes 302-346may include different data types presented together. In this example,the elements having the same data type may be presented in the same row,such as indicated by the labels Type A, Type B, Type C, Type D, Type E,and Type F. However, embodiments are not limited in this manner anddifferent visual presentations may be utilized, e.g., the data of thesame data type in the same columns.

FIG. 3 illustrates one possible result that may be presented to a userof the system 106 in response to a query. As previously discussed,system 106 may process queries from users and perform searches utilizingthe search functionality 118 to generate results. The results mayinclude elements related to the query and connections between theelements. For example, results from the search functionality 118 may beprovided to a web service application(s) 116 to enable a user tovisualize the results. The web service application(s) 116 may be a graphand data visualization application, such as Tom Sawyer®, and/or builtutilizing an application, such as React, Vue, or Angular. The graph anddata visualization application may organize the results including thenodes and edges, such that the user may understand relationships,trends, and patterns.

Initially, the system 100 including the web service application(s) 116may present all of the results including all of the entities in a GUI.The web service application(s) 116 may enable users to manipulate thedata, focus on specific information, narrow down results to moreapplicable results, and so forth. These operations may be performed viaone or more user inputs and displayed in a graphical user interface(GUI). In some instances, the web service application(s) 116 enables theend-users to tailor the results presented to them such that they can seethe relationships between the nodes of different data types. Forexample, a user may be able to focus on a particular data type or set athreshold for a number of data elements in common with another element.The user may utilize a user input, such as a slider bar, to set a numberof common entities to a specific entity or data type that must exist forentities to be presented. For example, if a user sets the commonelements threshold to ten for a given element or elements of aparticular data type. Only elements including entities having ten ormore entities in common with the other entit(ies) will be presented.With respect to the audit example, a user may only want to auditentities having ten or more validation items in common and relating tobusiness issues. A user may utilize the results to determine an owner,such as an account executive or line of business relating to thevalidation items and issues that may possibly affect the line ofbusiness.

FIG. 4 illustrates an example processing flow 400 that may be performedby the systems discussed herein to process unstructured data todetermine entities and connection information. The illustratedprocessing flow 400 may be performed by the system 102 and includesapplying a named entity recognition technique.

At block 402, the processing flow 400 includes obtaining theunstructured data. For example, the system 102 may retrieve the datafrom a storage 110. The unstructured data may be collected on an ongoingbasis and include data from computer documents, electronic mail(e-mail), chat conversations, and so forth. In other instances, theunstructured data may be obtained from other sources, such as websites,as previously discussed. For example, the system 102 may periodicallyscrape and collect data from different websites that relate to rules andregulations.

At block 404, the processing flow 400 includes analyzing the data. Forexample, the system 102 processes the unstructured data. Processing thedata may include collecting the data at a central location, such asstorage 110, and putting the data into a format such that the extractionoperation may be performed. At block 406, the processing flow 400includes determining common named entities for elements. For example, aNER operation may locate and classify each named entity in each piece ofunstructured data into predefined categories, such as law, organization,product name, etc. The NER operation may be performed on all of theunstructured data collected and stored in storage 110.

At block 408, the processing flow 400 includes determining connectioninformation for each of the named entities discovered. The connectioninformation may link entities within common unstructured data. Forexample, the system 102 may link entities or generate connectioninformation for entities from different data sources, e.g., documents,e-mails, websites, etc. when they have common named entities, asillustrated in FIG. 5 . The connection information may be determinedbased on entities having a number of common NERs above a thresholdvalue, as will be discussed in the example of FIG. 5 .

FIG. 5 illustrates an example data diagram 500 based on a result of aNER operation performed on unstructured data. The illustrated exampleonly includes unstructured data relating to two different elements,element 502 and element 510; however, embodiments are not limited inthis manner. In embodiments, the NER operations may be performed todetect any number of entities within unstructured data. As previouslydiscussed and with respect to the audit example, there are certainelements in as laws, regulations, organizations and people that are tiedto the audit universe but are only described in text-based descriptionsin unstructured data. The NER operation can detect and retrieve theelements that are mentioned in the text and then model them into theaudit universe automatically. These entities provide fine-grainedinformation in audit.

In the illustrated example of FIG. 5 , element 502 may be an issue andbe associated with unstructured data, e.g., an issue described in ane-mail or documents. Element 510 may be another issue and may beassociated with different unstructured data 508. The system 102 mayperform the NER operation on the unstructured data 504 and unstructureddata 508 to identify common entities. In FIG. 5 , the list of namedentities 506 includes all of the named entities found in both sets ofthe unstructured data 504 and unstructured data 508. The NER operationmay further identify named entities that are common from each of theunstructured data 504 and unstructured data 508. In the illustratedexample, both data sets include BB and DD. The NER operation maydetermine whether the number of common entities equals (or exceeds) athreshold value. For example, a threshold value may be set to two, andin this case, the NER operation may determine that the threshold valueis met. Based on the threshold value being met, a link or connection maybe generated between element 502 and element 510. However, if thethreshold value is not met, a link will not be generated. Note thatdifferent logic may be utilized, and embodiments are not limited in thismanner. Further, the threshold may be user configured or adjusted bysystem 102.

The system 102 may run a NER operation on the unstructured data anddetect all of the named entities within the data. The system 102 mayfurther generate connection information for the unstructured data,including determining links between elements based on having commonnamed entities. The elements and connection information may be furtherprocessed with the structured data to generate a graph model for storagein a graph database configuration. FIGS. 6 and 7 discuss additionaloperations that may be performed to detect additional connectioninformation between entities.

FIG. 6 illustrates an example processing flow 600 that may be performedto generate additional connection information to define edges betweennodes and entities based on text similarities. In some instances, datamay include a number of entities, that are although are different, theyare related in some manner. For example, an audit may include data thatidentifies a number of issues. Some of these issues may be raised indifferent lines of business but share common themes or involve the samebusiness processes. Embodiments may include applying text similarityoperations on the data to identify connections. For example, system 102may apply embeddings (sentence embeddings, document embeddings) and asimilarity algorithm to calculate similarities between pairs of issuesand connect these issues in the graph database. These techniques providebetter connectivity and allow graph queries to extract themes andemerging issues that are otherwise hard to detect.

At block 602, the processing flow 600 includes obtaining data includingelements. The data may be collected from one or more of the structureddata, and unstructured data, or a combination thereof. At block 604, theprocessing flow 600 includes applying a text similarity analysis to thedata. In one example, the system 102 may apply a cosine similarityanalysis to each of the structured data and/or unstructured data tomeasure the similarity in the text to determine connection informationand elements of nodes at block 606. The one or more edges may bedetermined for the nodes based on a similarity score above a similaritythreshold as a result of the text-similarity analysis. Embodiments arenot limited to utilizing a cosine similarity analysis to determine textsimilarities in the data. Other examples may include machine-learningtechniques including structured learning by training a model with a dataset or unstructured learning. Other algorithms may include, but are notlimited to, Jaccard Similarity, Different embeddings+ K-means, Differentembeddings+ Cosine Similarity, Word2Vec+Smooth Inverse Frequency+CosineSimilarity, Different embeddings+LSI+Cosine Similarity, Differentembeddings+ LDA+Jensen-Shannon distance, Different embeddings+ WordMover Distance, Different embeddings+ Variational Auto Encoder (VAE),Different embeddings+ Universal sentence encoder, Different embeddings+Siamese Manhattan LSTM, BERT embeddings+Cosine Similarity,Knowledge-based Measures.

FIG. 7 illustrates an example similarity analysis diagram 700 thatincludes the results of a similarity analysis performed. In theillustrated similarity analysis diagram 700, only a limited number ofelements and connections are illustrated for explanation purposes andembodiments include applying a similarity analysis on any number ofelements.

In FIG. 7 , element 702 may be connected with additional elements A, B,and C, and element 704 may be connected with additional elements D, E,and F. In one example, 702 may represent a first line of business andelements A, B, and C may be different issues that are affected the lineof business. Element 704 represents a second line of business andelements D, E, and F may be different issues that are affected thesecond line of business.

The system 102 may perform the similarity analysis and indicate howsimilar (or not similar) elements are to each other. For example,applying a cosine similarity analysis measures the similarity betweentwo vectors of an inner product space. It is measured by the cosine ofthe angle between two vectors and determines whether two vectors arepointing in roughly the same direction and is used to measure documentsimilarity in text analysis, such as text from different elements.

A document can be represented by thousands of attributes, each recordingthe frequency of a particular word (such as a keyword) or phrase in thedocument. Thus, each document is an object represented by what is calleda document embedding vector. In embodiments, the cosine similarity maydetermine the words that are in the elements that are in common, and theoccurrence frequency of such words. The analysis may indicate when textembeddings from two different elements include a cosine of the anglebetween the two vectors associated is above a threshold value. In theillustrated example, the distance between elements and A and D and (Band C), as indicated by the dashed line, may be above the thresholdvalue and a connection between A and D and (B and C) may be made.Embodiments are not limited to applying the cosine similarity algorithmand other algorithms may be used, as previously discussed. Once all ofthe nodes including the elements and connections are determined thegraph database may be generated. Users then may query the graph databaseto determine relationships between elements, as will be discussed inmore detail in FIGS. 8 and 9A/9B.

FIG. 8 illustrates an example of a processing flow 800 that may beperformed by a system discussed herein to process the query of a graphdatabase. For example, the system 106 may receive a query from a userand return a result of the query.

At block 802, the processing flow 800 includes receiving and processinga query for information in a graph database. The query may indicate anelement of a node. For example, a user may submit a search for aparticular issue to determine other elements related to the issue, e.g.,different lines of businesses, audit entities that might be impacted bythe selected issue, etc.

At block 804, the processing flow 800 including accessing the graphmodel and information stored in the graph database. Further and at block806, the processing flow 800 may determine related elements. In someinstances, the system 106 may determine all elements having at least oneconnection or relationship with the query. In this example, the system106 may return every element that is connected to the query by oneconnection as a result to the query. In another example, a user may seta threshold value indicating a number of connections/relationshipsrequired. For example, a user may set the threshold value to three andonly elements having three or more connections (direct or indirect) withthe element of the query. A direct connection may be one connecteddirectly with the query and indirect connections may be through anotherelement.

In some instances, a user may query the system to return results havinga number of connections above a threshold value for a particular datatype. For example, a user may submit a query to show issues and onlywant the results to include elements having more than three or moreimpacted audit entities. Thus, the user may focus and determine issuesthat have high level of risk pervasiveness. Once the results aredetermined, the processing flow 800 includes presenting the relatedelements at block 808, e.g., in a web service application(s) 116 such asa data visualization application.

FIG. 9 illustrates an example of results 900 that may be returned inresponse to a query submitted by a user. In this illustrated example, auser may submit a query as indicated by element 902. The system 106 mayreceive the query and determine elements that are related to element 902by at least one connection either directly or indirectly.

As discussed above in FIG. 8 , in some instances a user may desire tofocus on elements of a particular data type related to the query element902. For example, elements 918, and 920 may represent different lines ofbusiness or different leaders. In one example, a user may only want tosee and/or include in the elements of the result of the particular datatype (lines of businesses or leadership) that has more than twoconnections. In this example, only element 918 may be returned to theuser since element 920 does not have more than two connections withelement 902, either directly or indirectly.

However, in another example, a user may want to see results withelements of the particular data type having more than one connection. Inthis example, the system 106 may present both elements 918 and 920 sincethey both have at least two connections with the query element.Embodiments are not limited to the illustrated example, and differentresults may be generated and returned based on any number of user and/orsystem settings.

FIG. 10 illustrates an embodiment of an exemplary computer architecture1000 suitable for implementing various embodiments as previouslydescribed. In one embodiment, the computer architecture 1000 may includeor be implemented as part of systems 102 and 106.

As used in this application, the terms “system” and “component” areintended to refer to a computer-related entity, either hardware, acombination of hardware and software, software, or software inexecution, examples of which are provided by the exemplary computingcomputer architecture 1000. For example, a component can be, but is notlimited to being, a process running on a processor, a processor, a harddisk drive, multiple storage drives (of optical and/or magnetic storagemedium), an object, an executable, a thread of execution, a program,and/or a computer. By way of illustration, both an application runningon a server and the server can be a component. One or more componentscan reside within a process and/or thread of execution, and a componentcan be localized on one computer and/or distributed between two or morecomputers. Further, components may be communicatively coupled to eachother by various types of communications media to coordinate operations.The coordination may involve the uni-directional or bi-directionalexchange of information. For instance, the components may communicateinformation in the form of signals communicated over the communicationsmedia. The information can be implemented as signals allocated tovarious signal lines. In such allocations, each message is a signal.Further embodiments, however, may alternatively employ data messages.Such data messages may be sent across various connections. Exemplaryconnections include parallel interfaces, serial interfaces, and businterfaces.

The computing architecture 100 includes various common computingelements, such as one or more processors, multi-core processors,co-processors, memory units, chipsets, controllers, peripherals,interfaces, oscillators, timing devices, video cards, audio cards,multimedia input/output (I/O) components, power supplies, and so forth.The embodiments, however, are not limited to implementation by thecomputing architecture 100.

As shown in FIG. 10 , the computing architecture 100 includes aprocessor 1012, a system memory 1004 and a system bus 1006. Theprocessor 1012 can be any of various commercially available processors.

The system bus 1006 provides an interface for system componentsincluding, but not limited to, the system memory 1004 to the processor1012. The system bus 1006 can be any of several types of bus structurethat may further interconnect to a memory bus (with or without a memorycontroller), a peripheral bus, and a local bus using any of a variety ofcommercially available bus architectures. Interface adapters may connectto the system bus 608 via slot architecture. Example slot architecturesmay include without limitation Accelerated Graphics Port (AGP), CardBus, (Extended) Industry Standard Architecture ((E)ISA), Micro ChannelArchitecture (MCA), NuBus, Peripheral Component Interconnect (Extended)(PCI(X)), PCI Express, Personal Computer Memory Card InternationalAssociation (PCMCIA), and the like.

The computing architecture 100 may include or implement various articlesof manufacture. An article of manufacture may include acomputer-readable storage medium to store logic. Examples of acomputer-readable storage medium may include any tangible media capableof storing electronic data, including volatile memory or non-volatilememory, removable or non-removable memory, erasable or non-erasablememory, writeable or re-writeable memory, and so forth. Examples oflogic may include executable computer program instructions implementedusing any suitable type of code, such as source code, compiled code,interpreted code, executable code, static code, dynamic code,object-oriented code, visual code, and the like. Embodiments may also beat least partly implemented as instructions contained in or on anon-transitory computer-readable medium, which may be read and executedby one or more processors to enable performance of the operationsdescribed herein.

The system memory 1004 may include various types of computer-readablestorage media in the form of one or more higher speed memory units, suchas read-only memory (ROM), random-access memory (RAM), dynamic RAM(DRAM), Double-Data-Rate DRAM (DDRAM), synchronous DRAM (SDRAM), staticRAM (SRAM), programmable ROM (PROM), erasable programmable ROM (EPROM),electrically erasable programmable ROM (EEPROM), flash memory, polymermemory such as ferroelectric polymer memory, ovonic memory, phase changeor ferroelectric memory, silicon-oxide-nitride-oxide-silicon (SONOS)memory, magnetic or optical cards, an array of devices such as RedundantArray of Independent Disks (RAID) drives, solid state memory devices(e.g., USB memory, solid state drives (SSD) and any other type ofstorage media suitable for storing information. In the illustratedembodiment shown in FIG. 10 , the system memory 1004 can includenon-volatile 1008 and/or volatile 1010 memory. A basic input/outputsystem (BIOS) can be stored in the non-volatile 1008.

The computer 1002 may include various types of computer-readable storagemedia in the form of one or more lower speed memory units, including aninternal (or external) hard disk drive 1030, a magnetic disk drive 1016to read from or write to a removable magnetic disk 1020, and an opticaldisk drive 1028 to read from or write to a removable optical disk 1032(e.g., a CD-ROM or DVD). The hard disk drive 1030, magnetic disk drive1016 and optical disk drive 1028 can be connected to system bus 1006 theby an HDD interface 1014, and FDD interface 1018 and an optical diskdrive interface 1034, respectively. The HDD interface 1014 for externaldrive implementations can include at least one or both of UniversalSerial Bus (USB) and IEEE 1394 interface technologies.

The drives and associated computer-readable media provide volatileand/or nonvolatile storage of data, data structures, computer-executableinstructions, and so forth. For example, a number of program modules canbe stored in the drives and non-volatile 1008, and volatile 1010 memory,including an operating system 1022, one or more applications 1042, otherprogram modules 1024, and program data 1026. In one embodiment, the oneor more applications 1042, other program modules 1024, and program data1026 can include, for example, the various applications and/orcomponents of the system 100.

A user can enter commands and information into the computer 1002 throughone or more wire/wireless input devices, for example, a keyboard 1050and a pointing device, such as a mouse 1052. Other input devices mayinclude microphones, infra-red (IR) remote controls, radio-frequency(RF) remote controls, game pads, stylus pens, card readers, dongles,finger print readers, gloves, graphics tablets, joysticks, keyboards,retina readers, touch screens (e.g., capacitive, resistive, etc.),trackballs, track pads, sensors, styluses, and the like. These and otherinput devices are often connected to the processor 1012 through an inputdevice interface 1036 that is coupled to the system bus 1006 but can beconnected by other interfaces such as a parallel port, IEEE 1394 serialport, a game port, a USB port, an IR interface, and so forth.

A monitor 1044 or other type of display device is also connected to thesystem bus 1006 via an interface, such as a video adapter 1046. Themonitor 1044 may be internal or external to the computer 1002. Inaddition to the monitor 1044, a computer typically includes otherperipheral output devices, such as speakers, printers, and so forth.

The computer 1002 may operate in a networked environment using logicalconnections via wire and/or wireless communications to one or moreremote computers, such as a remote computer(s) 1048. The remotecomputer(s) 1048 can be a workstation, a server computer, a router, apersonal computer, portable computer, microprocessor-based entertainmentappliance, a peer device or other common network node, and typicallyincludes many or all the elements described relative to the computer1002, although, for purposes of brevity, only a memory and/or storagedevice 1058 is illustrated. The logical connections depicted includewire/wireless connectivity to a local area network 1056 and/or largernetworks, for example, a wide area network 1054. Such LAN and WANnetworking environments are commonplace in offices and companies, andfacilitate enterprise-wide computer networks, such as intranets, all ofwhich may connect to a global communications network, for example, theInternet.

When used in a local area network 1056 networking environment, thecomputer 1002 is connected to the local area network 1056 through a wireand/or wireless communication network interface or network adapter 1038.The network adapter 1038 can facilitate wire and/or wirelesscommunications to the local area network 1056, which may also include awireless access point disposed thereon for communicating with thewireless functionality of the network adapter 1038.

When used in a wide area network 1054 networking environment, thecomputer 1002 can include a modem 1040, or is connected to acommunications server on the wide area network 1054 or has other meansfor establishing communications over the wide area network 1054, such asby way of the Internet. The modem 1040, which can be internal orexternal and a wire and/or wireless device, connects to the system bus1006 via the input device interface 1036. In a networked environment,program modules depicted relative to the computer 1002, or portionsthereof, can be stored in the remote memory and/or storage device 1058.It will be appreciated that the network connections shown are exemplaryand other means of establishing a communications link between thecomputers can be used.

The computer 1002 is operable to communicate with wire and wirelessdevices or entities using the IEEE 802 family of standards, such aswireless devices operatively disposed in wireless communication (e.g.,IEEE 802.11 over-the-air modulation techniques). This includes at leastWi-Fi (or Wireless Fidelity), WiMax, and Bluetooth™ wirelesstechnologies, among others. Thus, the communication can be a predefinedstructure as with a conventional network or simply an ad hoccommunication between at least two devices. Wi-Fi networks use radiotechnologies called IEEE 802.118 (a, b, g, n, etc.) to provide secure,reliable, fast wireless connectivity. A Wi-Fi network can be used toconnect computers to each other, to the Internet, and to wire networks(which use IEEE 802.3-related media and functions).

The various elements of the devices as previously described withreference to FIGS. 1-9 may include various hardware elements, softwareelements, or a combination of both. Examples of hardware elements mayinclude devices, logic devices, components, processors, microprocessors,circuits, processors, circuit elements (e.g., transistors, resistors,capacitors, inductors, and so forth), integrated circuits, applicationspecific integrated circuits (ASIC), programmable logic devices (PLD),digital signal processors (DSP), field programmable gate array (FPGA),memory units, logic gates, registers, semiconductor device, chips,microchips, chip sets, and so forth. Examples of software elements mayinclude software components, programs, applications, computer programs,application programs, system programs, software development programs,machine programs, operating system software, middleware, firmware,software modules, routines, subroutines, functions, methods, procedures,software interfaces, application program interfaces (API), instructionsets, computing code, computer code, code segments, computer codesegments, words, values, symbols, or any combination thereof. However,determining whether an embodiment is implemented using hardware elementsand/or software elements may vary in accordance with any number offactors, such as desired computational rate, power levels, heattolerances, processing cycle budget, input data rates, output datarates, memory resources, data bus speeds and other design or performanceconstraints, as desired for a given implementation.

FIG. 11 is a block diagram depicting an exemplary communicationsarchitecture 1100 suitable for implementing various embodiments aspreviously described. The communications architecture 1100includesvarious common communications elements, such as a transmitter, receiver,transceiver, radio, network interface, baseband processor, antenna,amplifiers, filters, power supplies, and so forth. The embodiments,however, are not limited to implementation by the communicationsarchitecture 1100, which may be consistent with system 100.

As shown in FIG. 11 , the communications architecture 1100 includes oneor more client(s) 1102 and server(s) 1104. The server(s) 1104 mayimplement one or more devices of system 100. The client(s) 1102 and theserver(s) 1104 are operatively connected to one or more respectiveclient data store 1106 and server data store 1108 that can be employedto store information local to the respective client(s) 1102 andserver(s) 1104, such as cookies and/or associated contextualinformation.

The client(s) 1102 and the server(s) 1104 may communicate informationbetween each other using a communication framework 1110. Thecommunication framework 1110 may implement any well-known communicationstechniques and protocols. The communication framework 1110 may beimplemented as a packet-switched network (e.g., public networks such asthe Internet, private networks such as an enterprise intranet, and soforth), a circuit-switched network (e.g., the public switched telephonenetwork), or a combination of a packet-switched network and acircuit-switched network (with suitable gateways and translators).

The communication framework 1110 may implement various networkinterfaces arranged to accept, communicate, and connect to acommunications network. A network interface may be regarded as aspecialized form of an input/output (I/O) interface. Network interfacesmay employ connection protocols including without limitation directconnect, Ethernet (e.g., thick, thin, twisted pair 10/100/1000 Base T,and the like), token ring, wireless network interfaces, cellular networkinterfaces, IEEE 802,11a-x network interfaces, IEEE 802.16 networkinterfaces, IEEE 802.11 network interfaces, and the like. Further,multiple network interfaces may be used to engage with variouscommunications network types. For example, multiple network interfacesmay be employed to allow for the communication over broadcast,multicast, and unicast networks. Should processing requirements dictatea greater amount speed and capacity, distributed network controllerarchitectures may similarly be employed to pool, load balance, andotherwise increase the communicative bandwidth required by client(s)1102 and the server(s) 1104. A communications network may be any one andthe combination of wired and/or wireless networks including withoutlimitation a direct interconnection, a secured custom connection, aprivate network (e.g., an enterprise intranet), a public network (e.g.,the Internet), a Personal Area Network (PAN), a Local Area Network(LAN), a Metropolitan Area Network (MAN), an Operating Missions as Nodeson the Internet (OMNI), a Wide Area Network (WAN), a wireless network, acellular network, and other communications networks.

The components and features of the devices described above may beimplemented using any combination of discrete circuitry, applicationspecific integrated circuits (ASICs), logic gates and/or single chiparchitectures. Further, the features of the devices may be implementedusing microcontrollers, programmable logic arrays and/or microprocessorsor any combination of the foregoing where suitably appropriate. It isnoted that hardware, firmware and/or software elements may becollectively or individually referred to herein as “logic” or “circuit.”

What is claimed is:
 1. A computer-implemented method, comprising:receiving, by at least one processor, a query for retrieving data storedin a graph database, the data being stored using one or more graph datamodels having a plurality of elements connected using a plurality ofconnections, the query identifying at least one first element in theplurality of elements stored in the graph database for retrieval;executing, by the at least one processor, a similarity detection toidentify one or more second elements in the plurality of elementsrelated to the at least one first element and one or more connections inthe plurality of connections associated with at least one of the one ormore second elements and the at least one first element; selecting, bythe at least one processor, at least one second element in the one ormore second elements and at least one connection in the identified oneor more connections responsive to the query; and outputting, by the atleast one processor, the at least one first element and the selected atleast one second element.
 2. The method according to claim 1, whereinthe executing includes training at least one model to identify the oneor more connections in the plurality of connections.
 3. The methodaccording to claim 2, wherein the at least one model is trained usingdata associated with one or more historical connections between one ormore elements in the plurality of elements.
 4. The method according toclaim 1, wherein the graph database is configured to store at least oneof the following: a structured data, an unstructured data, and anycombination thereof.
 5. The method according to claim 1, wherein thesimilarity detection identifying the one or more second elementsincludes at least one of the following similarities: an element namesimilarity, an element type similarity, an element text similarity, andany combination thereof.
 6. The method according to claim 5, wherein thesimilarities are detected between at least one of the following: the atleast one first element and the one or more second elements, the one ormore second elements, at least another element in the plurality ofelements and at least one of the at least one first element and the oneor more second elements, and any combination thereof.
 7. The methodaccording to claim 1, wherein the similarity detection is executed usingat least one of the following: a structured machine learning by trainingone or more models with a data set of elements, an unstructuredlearning, and any combinations thereof.
 8. The method according to claim1, wherein the selecting includes selecting the at least one secondelement in the one or more second elements based on a predeterminednumber of connections associated with at least one of: the at least onesecond element, the at least one first element, and any combinationsthereof.
 9. The method according to claim 8, wherein the received queryidentifies the predetermined number of connections.
 10. The methodaccording to claim 1, wherein the selecting includes selecting apredetermined number of second elements in the one or more secondelements.
 11. The method according to claim 10, wherein the receivedquery identifies the predetermined number of second elements.
 12. Themethod according to claim 1, wherein the one or more connections areidentified based on at least one of the following: the identified one ormore second elements, the at least one first element, and anycombination thereof.
 13. The method according to claim 1, wherein theone or more connections include at least one of the following: a directconnection, an indirect connection, and any combination thereof.
 14. Asystem, comprising: at least one processor; and at least onenon-transitory storage media storing instructions, that when executed bythe at least one processor, cause the at least one processor to performoperations including receiving a query for retrieving data stored in agraph database, the data being stored using one or more graph datamodels having a plurality of elements connected using a plurality ofconnections, the query identifying at least one first element in theplurality of elements stored in the graph database for retrieval;training at least one model to identify one or more connections in theplurality of connections associated with at least one of: one or moresecond elements in the plurality of elements related to the at least onefirst element, and the at least one first element; selecting at leastone second element in the one or more second elements and at least oneconnection in the identified one or more connections responsive to thequery; and outputting the at least one first element and the selected atleast one second element.
 15. The system according to claim 14, whereinthe at least one model is trained using data associated with one or morehistorical connections between one or more elements in the plurality ofelements.
 16. The system according to claim 14, wherein the graphdatabase is configured to store at least one of the following: astructured data, an unstructured data, and any combination thereof. 17.The system according to claim 14, wherein the one or more secondelements are identified using at least one of the followingsimilarities: an element name similarity, an element type similarity, anelement text similarity, and any combination thereof; wherein thesimilarities are detected between at least one of the following: the atleast one first element and the one or more second elements, the one ormore second elements, at least another element in the plurality ofelements and at least one of the at least one first element and the oneor more second elements, and any combination thereof.
 18. The systemaccording to claim 14, wherein the selecting includes selecting the atleast one second element in the one or more second elements based on apredetermined number of connections associated with at least one of: theat least one second element, the at least one first element, and anycombinations thereof, wherein the received query identifies thepredetermined number of connections.
 19. The system according to claim14, wherein the selecting includes selecting a predetermined number ofsecond elements in the one or more second elements, wherein the receivedquery identifies the predetermined number of second elements.
 20. Acomputer program product comprising a non-transitory machine-readablemedium storing instructions that, when executed by at least oneprogrammable processor, cause the at least one programmable processor toperform operations comprising: receiving, by at least one processor, aquery for retrieving data stored in a graph database, the data beingstored using one or more graph data models having a plurality ofelements connected using a plurality of connections, the queryidentifying at least one first element in the plurality of elementsstored in the graph database for retrieval; executing, by the at leastone processor, a similarity detection to identify one or more secondelements in the plurality of elements related to the at least one firstelement and one or more connections in the plurality of connectionsassociated with at least one of the one or more second elements and theat least one first element, wherein the similarities are detectedbetween at least one of the following: the at least one first elementand the one or more second elements, the one or more second elements, atleast another element in the plurality of elements and at least one ofthe at least one first element and the one or more second elements, andany combination thereof; selecting, by the at least one processor, atleast one second element in the one or more second elements and at leastone connection in the identified one or more connections responsive tothe query; and outputting, by the at least one processor, the at leastone first element and the selected at least one second element.